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This paper introduces Monte Carlo *-Minimax Search (MCMS), a Monte Carlo search algorithm for 
turned-based, stochastic, two-player, zero-sum games of perfect information. The algorithm is designed 
for the class of densely stochastic games; that is, games where one would rarely expect to sample the 
same successor state multiple times at any particular chance node. Our approach combines sparse sampling 
1 , techniques from MDP planning with classic pruning techniques developed for adversarial expectimax plan- 

£_) ■ ning. We compare and contrast our algorithm to the traditional *-Minimax approaches, as well as MCTS 

enhanced with the Double Progressive Widening, on four games: Pig, EinStein Wiirfelt Nicht!, Can't Stop, 
and Ra. Our results show that MCMS can be competitive with enhanced MCTS variants in some domains, 
while consistently outperforming the equivalent classic approaches given the same amount of thinking time. 

O ' 1 Introduction 

_J ' Monte Carlo sampling has recently become a popular technique for online planning in large sequential games. 

For example UCT and, more generally, Monte Carlo Tree Search (MCTS) [16 8 1 has led to an increase in the 
performance of Computer Go players [17], and numerous extensions and applications have since followed [3 1. 
Initially, MCTS was applied to games lacking strong Minimax players, but recently has been shown to compete 
against strong Minimax players in such games ||29ll2TI . One class of games that has proven more resistant 
is stochastic games. Unlike classic games such as Chess and Go, stochastic game trees include chance 
nodes in addition to decision nodes. How MCTS should account for this added uncertainty remains unclear. 
Moreover, many of the search enhancements from the classic a/3 literature cannot be easily adapted to MCTS. 
The classic algorithms for stochastic games, EXPECTIMAX and * -Minimax (Starl and Star2), perform look- 
ahead searches to a limited depth. However, the running time of these algorithms scales exponentially in 
the branching factor at chance nodes as the search horizon is increased. Hence, their performance in large 
games often depends heavily on the quality of the heuristic evaluation function, as only shallow searches are 
possible. 

One way to handle the uncertainty at chance nodes would be forward pruning [26], but the performance 
gain until now has been small [24|. Another way is to simply sample a single outcome when encountering 
a chance node. This is common practice in MCTS when applied to stochastic games. However, the general 
performance of this method is unknown. Large stochastic domains still pose a significant challenge. For 
instance, MCTS is outperformed by *-Minimax in the game of Carcassonne lfj"2"l . Unfortunately, the literature 
on the application of Monte Carlo search methods to stochastic games is relatively small. 
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In this paper, we investigate the use of Monte Carlo sampling in *-Minimax search. We introduce a 
new algorithm, Monte Carlo *-Minimax Search (MCMS), which samples a subset of chance node outcomes 
in EXPECTIMAX and *-Minimax in stochastic games. In particular, we describe a sampling technique for 
chance nodes based on sparse sampling |14| and show that MCMS approaches the optimal decision as the 
number of samples grows. We evaluate the practical performance of MCMS in four domains: Pig, EinStein 
Wurfelt Nicht!, Can't Stop, and Ra. In Pig, we show that the estimates returned by MCMS have lower bias 
and lower regret than the estimates returned by the classic * -Minimax algorithms. Finally, we show that 
the addition of sampling to *-Minimax can increase its performance from inferior to competitive against 
state-of-the-art MCTS, and in the case of Ra, can even perform better than MCTS. 

2 Background 

A finite, two-player zero-sum game of perfect information can be described as a tuple (S, T, A, P, U\, s{), 
which we now define. The state space S is a finite, non-empty set of states, with T C S denoting the finite, 
non-empty set of terminal states. The action space A is a finite, non-empty set of actions. The transition 
probability function V assigns to each state-action pair (s,a) e5xia probability measure over S that 
we denote by V(- 1 s, a). The utility function u\ : T i-4 [v m m, v max ] C R gives the utility of player 1, 
with v m in and V max denoting the minimum and maximum possible utility, respectively. Since the game is 
zero-sum, the utility of player 2 in any state s £ T is given by U2(s) := — Ui(s). The player index function 
r : S\T — > {1,2} returns the player to act in a given non-terminal state s. 

Each game starts in the initial state si with t(s\) := 1, and proceeds as follows. For each time step 
! £ N, player T(s t ) selects an action a t £ A in state s t , with the next state s t+1 generated according to 
Pi' I s tj a t)- Player T(s t+ i) then chooses a next action and the cycle continues until some terminal state 
St £ Tis reached. At this point player 1 and player 2 receive a utility of Ui(st) and U2(st) respectively. 

2.1 Classic Game Tree Search 

We now describe the two main search paradigms for adversarial stochastic game tree search. We begin by 
first describing classic stochastic search techniques, that differ from modern approaches in that they do not 
use Monte Carlo sampling. This requires recursively defining the minimax value of a state s E S, which is 
given by 

{m&xY, Pis'\s,a)Vis') ifs^T,r(s) = l 
aeA s , eS 
mmJ2Pis'\s,a)Vis') if s^T,r(s)= 2 
a€A s , eS 
ui(s) otherwise. 

Note that here we always treat player 1 as the player maximizing ui(s) (Max), and player 2 as the player 
minimizing «i(s) (Min). In most large games, computing the minimax value for a given game state is 
intractable. Because of this, an often used approximation is to instead compute the depth d minimax value. 
This requires limiting the recursion to some fixed depth d £ N and applying a heuristic evaluation function 
when this depth limit is reached. Thus given a heuristic evaluation function h : S — > [v m i n ,v max ] C R 
defined with respect to player 1 that satisfies the requirement h(s) = Wi(s) when s £ T, the depth d 




Figure 1: An example of the Starl algorithm. 



minimax value is defined recursively by 



max Vd(s, a) if d > 0, s $ T, and t(s) = 1 
aeA 

V d (s) = { min V d (s, a) if d > 0, s £ T, and r(s) = 2 

a£A 

h(s) otherwise, 



where 



F d (s,a) = ^P(s'|s,a)y d _ 1 (s')- (D 



s'GS 



For sufficiently large d, Vd(s) coincides with V(s). The quality of the approximation depends on both the 
heuristic evaluation function and the search depth parameter d. 

A direct computation of argmax ae ^4(,,.) Vd(s, a) or argrnin ae _4( s ) Vd{s, a) is equivalent to running the 
well known EXPECTIMAX algorithm [ 19 1. The base EXPECTIMAX algorithm can be enhanced by a technique 
similar to a/3 pruning lfT31 for deterministic game tree search. This involves correctly propagating the [a, 0\ 
bounds and performing an additional pruning step at each chance node. This pruning step is based on the 
observation that if the minimax value has already been computed for a subset of successors S C S, the depth 
d minimax value of state-action pair (s, a) must lie within 

Ld(s,a) < Vd{s,a) < Ud(s,a), 

where 

L d {s,a)= 5Z^(s'|s,a)Vd_i(s , )+ £} V{s' \ s, a)v min 
s'es s'es\s 

U d (s,a) =^V{s' \s,a)V d ~i(s') + ^ Vis' \ s,a)v max . 
s'es s'es\s 

These bounds form the basis of the pruning mechanisms in the *-Minimax [2| family of algorithms. In the 
Starl algorithm, each s' from the equations above represents the state reached after a particular outcome is 
applied at a chance node following (s, a). In practice, Starl maintains lower and upper bounds on Vd-i(s') 
for each child s' at chance nodes, using this information to stop the search when it finds a proof that any 
future search is pointless. 

To better understand when cutoffs occur in *-Minimax, we now present an example adapted from Bal- 
lard's original paper. Consider Figure [TJ The algorithm recurses down from state s with a window of 



[a, j3\ = [4, 5] and encounters a chance node. Without having searched any of the children the bounds 
for the values returned are (v mm , Umax) = ( — 10, +10). The subtree of a child, say s', is searched and returns 
Vd-i(s') = 2. Since this is now known, the upper and lower bounds for that outcome become 2. The lower 
bound on the minimax value of the chance node becomes (2 — 10 — 10)/3 and the upper bound becomes 
(2 + 10 + 10) /3, assuming a uniform distribution over chance events. If ever the lower bound on the value 
of the chance node exceeds or equals (3, or if the upper bound for the chance node is less than or equals a, 
the subtree is pruned. In addition, this bound information can be used to narrow the search window used to 
evaluate later child nodes. 

1 Starl (s, a,d,a,/3) 

2 if d ~ or s eT then return h(s) 

3 else 

4 O •<— genOutcomeSet (s, a) 

5 for o e O do 

6 a! -k— childAlpha (o, a) 

7 /3' <- childBeta(o, /3) 

8 s' <;— actionChanceEvent (s, a, o) 

9 v «- alphabetal (s 1 , d- 1, a', ft') 
10 oi <— v; o u <— v 

n if v > f3' then return pess(O) 

12 if v < a' then return opti(O) 

13 return Vd(s, a) 

Algorithm 1: Starl 

The algorithm is summarized in AlgorithmQ] The alphabetal procedure recursively calls Starl. The 
outcome set O is an array of tuples, one per outcome. One such tuple o has three attributes: a lower bound 
oi initialized to u m i n , an upper bound o u initialized to v max , and the outcome's probability o p . The pess 
function returns the current lower bound on the chance node pess(O) = J2 £0 °p°i- Similarly, opti returns 
the current upper bound on the chance node using o u in place of of. opti(O) = J^oeo °p°u- Finally, the 
functions childAlpha and childBeta return the new bounds on the value of the respective child below. 
Continuing the example above, suppose the algorithm is ready to descend down the middle outcome. The 
lower bound for the child is derived from the equation (2 + o p a' + 10)/3 = a. Solving for a' here gives 
a' = (3a — 12)/ o p . In general: 



a - opti(O) 

^min? 



Jpu u 



, J P - pess(O) + o p oi \ 
p = mm <^ v max , ) 

I °p J 



The performance of the algorithm can be improved significantly by applying a simple look-ahead heuris- 
tic. Suppose the algorithm encounters a chance node. When searching the children of each outcome, one 
can temporarily restrict the legal actions at a successor (decision) node. If only a single action is searched 
at the successor, then the value returned will be a bound on Vd-i{s'). If the successor is a Max node, then 
the true value can only be larger, and hence the value returned is a lower bound. Similarly, if it was a Min 
node, the value returned is a lower bound. The Star2 algorithm applies this idea via a preliminary probing 



phase at chance nodes in hopes of pruning without requiring full search of the children. If probing does not 
lead to a cutoff, then the children are fully searched, but bound information collected in the probing phase 
can be re-used. When moves are appropriately ordered, the algorithm can often choose the best single move 
and effectively cause a cut-off with much less search effort. Since this idea is applied recursively, the benefits 
compounds as the depth increases. The algorithm is summarized in Algorithm^ The alphabeta2 proce- 
dure is analogous to alphabet a 1 except when pis true, a subset (of size one) of the actions are considered 
at the next decision node. The recursive calls to Star2 within alphabeta2 have p set to false and a set to 
the chosen action. 

1 Star2 (s, a, d, a, (3) 

2 if d = or s eT then return h(s) 

3 else 

4 O •<— genOutcomeSet (s, a) 

5 for o e O do 

6 a! -k— childAlpha (o, a) 

7 /?' <- childBeta (o, j3) 

8 s' <;— actionChanceEvent (s, a, o) 

9 v <— alphabeta2 (s', d — 1, a', j3' , true) 
10 if r(s') = 1 then 

n oi «- v 

12 if pess(C) > (3 then return pess(O) 

13 else if r(s') = 2 then 

14 O u <— V 

is if opti(O) < a then return opti(O) 

16 for o e O do 

17 a' -E- childAlpha (o, a) 
is /3' <- childBeta (o, /3) 

19 s' -s— actionChanceEvent (s, a, o) 

20 v «— alphabeta2 (s', d— 1, a', (3', false) 

21 o; <— t>; o u <— u 

22 if v > j3' then return pess(C) 

23 if w < a' then return opti(O) 

24 return V d (s, a) 

Algorithm 2: Star2 

Starl and Star2 are typically presented using the negamax formulation. In fact, Ballard originally re- 
stricted his discussion to regular *-Minimax trees, ones that strictly alternate Max, Chance, Min, Chance. We 
intentionally present the more general af3 formulation here because it handles a specific case encountered by 
three of our test domains. In games where the outcome of a chance node determines the next player to play, 
the cut criteria during the Star2 probing phase depends on the child node. The bound established by the Star2 
probing phase will either be a lower bound or an upper bound, depending on the child's type. This distinction 
is made in lines[l0]to[T5] Also note: when implementing the algorithm, for better performance it is advisable 
to incrementally compute the bound information [11]. 



2.2 Monte Carlo Tree Search 

Monte Carlo Tree Search (MCTS) has attracted significant attention in recent years. The main idea is to 
iteratively run simulations from the game's current position to a leaf, incrementally growing a tree rooted at 
the current position. In its simplest form, the tree is initially empty, with each simulation expanding the tree 
by an additional node. When this node is not terminal, a rollout policy takes over and chooses actions until 
a terminal state is reached. Upon reaching a terminal state, the observed utility is back-propagated through 
all the nodes visited in this simulation, which causes the value estimates to become more accurate over time. 
This idea of using random rollouts to estimate the value of individual positions has proven successful in Go 
and many other domains [8, 3 1. 

While descending through the tree, a sequence of actions must be selected for further exploration. A 
popular way to do this so as to balance between exploration and exploitation is to use algorithms developed for 
the well-known stochastic multi-armed bandit problem [ 1|. UCT is an algorithm that recursively applies one 
of these selection mechanism to trees [ 1 6 1 . An improvement of significant practical importance is progressive 
unpmning / widening J7] 0. The main idea is to purposely restrict the number of allowed actions, with 
this restriction being slowly relaxed so that the tree grows deeper at first and then slowly wider over time. 
Progressive widening has also been extended to include chance nodes, leading to the Double Progressive 
Widening algorithm (DPW) J6). When DPW encounters a chance or decision node, it computes a maximum 
number of actions or outcomes to consider k = \Cv a ~\, where C and a are parameter constants and v 
represents a number of visits to the node. At a decision node, then only the first k actions from the action set 
are available. At a chance node, a set of outcomes is stored and incrementally grown. An outcome is sampled; 
if k is larger than the size of the current set of outcomes and the newly sampled outcome is not in the set, it is 
added to the set. Otherwise, DPW samples from existing children at chance nodes in the tree, where a child's 
probability is computed with respect to the current children in the restricted set. This enhancement has been 
shown to improve the performance of MCTS in densely stochastic games. 

2.3 Sampling in Markov Decision Processes 

Computing optimal policies in large Markov Decision Processes (MDPs) is a significant challenge. Since the 
size of the state space is often exponential in the properties describing each state, much work has focused 
on finding efficient methods to compute approximately optimal solutions. One way to do this, given only a 
generative model of the domain, is to employ sparse sampling lfl4"l . When faced with a decision to make from 
a particular state, a local sub-MDP can be built using fixed depth search. When transitioning to successor 
states, a fixed number c G N of successor states are sampled for each action. Kearns et al. showed that for 
an appropriate choice of c, this procedure produces value estimates that are accurate with high probability. 
Importantly, c was shown to have no dependence on the number of states |<S|, effectively breaking the curse of 
dimensionality. This method of sparse sampling was later improved by using adaptive decision rules based on 
the multi-armed bandit literature to give the AMS algorithm [4]. Also, the Forward Search Sparse Sampling 
(FSSS) [28 1 algorithm was recently introduced, which exploits bound information to add a form of sound 
pruning to sparse sampling. The branch and bound pruning mechanism used by FSSS works similarly to 
Starl in adversarial domains. 



3 Sparse Sampling in Adversarial Games 

The practical performance of classic game tree search algorithms such as Starl or Star2 strongly depend 
on the typical branching factor at chance nodes. Since this can be as bad as |<S|, long-term planning using 
classic techniques is often infeasible in stochastic domains. However, like sparse sampling for MDPs in 
Section l2~3l this dependency can be removed by an appropriate use of Monte Carlo sampling. We now define 
the estimated depth d minimax value as 

maxV d (s, a) if d > 0, s £• T, and t(s) = 1 
Vd(s) := I minV d (s,a) if d > 0, s g T, and t(s) = 2 

' a€A 



where 



h(s) otherwise. 



V d (s,a):=±;J2Vd-i(si), 



for all s G S and a G A, with each successor state Si distributed according to V(- \ s, a) for 1 < i < c. This 
natural definition can be justified by the following result, which shows that the value estimates are accurate 
with high probability, provided c is chosen to be sufficiently large. 

Theorem 1. Given c G N, for any state s G S, for all A G (0, 2u max ] C R, for any depth d G Z + , 
F ( V d (s) - V d (s) <Xd)>l- (2c|^|) d exp (-^-1 . 

The proof is a straightforward generalization of the result of |fT4l for finite horizon, adversarial games, and 
is included in Appendix lAl Notice that although there is no dependence on |5|, there is still an exponential 
dependence on the horizon d. Thus an enormously large value of c will need to be used to obtain any 
meaningful theoretical guarantees. Nevertheless, we shall show later that surprisingly small values of c 
perform well in practice. Also note that our proof of Theorem Q] does not hold when sampling without 
replacement is used. Investigating whether the analysis can be extended to cover this case would be an 
interesting next step. 

3.1 Monte Carlo *-Minimax 

We are now in a position to describe the MCMS family of algorithms, which compute estimated depth d 
minimax values by recursively applying one of the Starl or Star2 pruning rules. The MCMS variants can be 
easily described in terms of the previous descriptions of the original Starl and Star2 algorithms. To enable 
sampling, one need only change the implementation of getOutcomeSet on line|4]of Algorithm Q] and 
line |4] of Algorithm |2] At a chance node, instead of recursively visiting the subtrees under each outcome, c 
outcomes are sampled with replacement and only the subtrees under those outcomes are visited; the value 
returned to the parent is the (equally weighted) average of the c samples. Equivalently, one can view this 
approach as transforming each chance node into a new chance node with c outcomes, each having probability 
-. We call these new variants starl SS and star2ss. If all pruning is disabled, we obtain EXPECTIMAX with 
sparse sampling (expss), which computes V d (s) directly from definition. At a fixed depth, if both algorithms 
sample identically the starl SS method computes exactly the same value as expss but will avoid useless work 



by using the Starl pruning rule. The case of star2ss is slightly more complicated. For Theorem[T]to apply, 
the bound information collected in the probing phase needs to be consistent with the bound information used 
after the probing phase. To ensure this, the algorithm must sample outcomes identically in the subtrees taken 
while probing and afterward. 

4 Empirical Evaluation 

We now describe our experiments. We start with our domains: Pig, EinStein Wiirfelt Nicht!, Can't Stop, and 
Ra. We then describe in detail our experiment setup. We then describe two experiments: one to determine 
the individual performance of each algorithm, and one to compute the statistical properties of the underlying 
estimators. 

4.1 Domains 

Pig is a two-player dice game [23 1. Players each start with points; the goal is to be the first player to achieve 
100 or more points. Each turn, players roll two dice and then, if there are no Q showing, add the sum to their 
turn total. At each decision point, a player may continue to roll or stop. If they decide to stop, they add their 
turn total to their total score and then it becomes the opponent's turn. Otherwise, they roll dice again for a 
chance to continue adding to their turn total. If a single is rolled the turn total will be reset and the turn 
ended (no points gained); if a BB is rolled then the players turn will end along with their total score being 
reset to 0. 

EinStein Wiirfelt Nicht! (EWN) is a game played on a 5 by 5 square board. Players start with six dice 
used as pieces (B, B, ..., (ED) in opposing corners of the board. The goal is to reach the opponent's corner 
square with a single die or capture every opponent piece. Each turn starts with the player rolling a neutral 
six-sided die whose result indicates which one of their pieces (dice) can move this turn. Then the player must 
move a piece toward the opponent's corner base (or off the board). Whenever moving onto a square with a 
piece, it is captured. EWN is a game played by humans and computer opponents on the Little Golem online 
board game site; at least two MCTS players have been developed to play it lfl"8ll25l . 

Can't Stop is a dice game [22] that is very popular on online gaming sitesU Can't Stop has also been a 
domain of interest to AI researchers [10. 9|. The goal is to obtain three complete columns by reaching the 
highest level in each of the 2-12 columns. This is done by repeatedly rolling 4 dice and playing zero or more 
pairing combinations. Once a pairing combination is played, a marker is placed on the associated column 
and moved upwards. Only three distinct columns can be used during any given turn. If dice are rolled and no 
legal pairing combination can be made, the player loses all of the progress made towards completing columns 
on this turn. After rolling and making a legal pairing, a player can chose to lock in their progress by ending 
their turn. 

Ra is a set collection bidding game, currently ranked #58 highest board game (out of several thousand) 
on the community site BoardGameGeek . com Players collect various combinations of tiles by winning 
auctions using the bidding tokens (suns). Each turn, a player chooses to either draw a tile from the bag or 
start an auction. When a special Ra tile is drawn, an auction starts immediately, and players use one of their 
suns to bid on the current group of tiles. By winning an auction, a player takes the current set of tiles and 
exchanges the winning sun with the one in the middle of the board, the one gained becoming inactive until 
the following round (epoch). When a player no longer has any active suns, they cannot take their turns until 
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Table 1: Mean statistical property values over 2470 Pig states. 



Algorithm 


MSE 


Property 
Variance Bias 


Regret 


MCTS 

DPW 


78.7 
79.4 


0.71 
5.3 


8.83 
8.61 


0.41 
0.96 


exp 

Starl 
Star2 


91.4 
91.0 
87.9 


0.037 
0.064 
0.008 


9.56 
9.54 
9.38 


0.56 
0.55 
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starlSS 
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99.8 
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Figure 2: Results of playing strength experiments. Each bar represents the percentage of wins for pi e f t 
in a Pieft-Pright pairing. (Positions are swapped and this notation refers only to the name order.) Errors 
bars represent 95% confidence intervals. Here, the best variant of MCTS is used in each domain, exp- 
MCTS, expSS-MCTS, Star2-MCTS, and star2ss-expSS are intentionally omitted since they look similar to 
Starl-MCTS, starl SS-MCTS, Starl-MCTS, and starl SS-expSS, respectively. 



the next epoch. Points are attributed to each player at the end of each epoch depending on their tile set as 
well as the tile sets of other players. 



4.2 Experimental Setup 

In our implementation, low-overhead static move orderings are used to enumerate actions. Iterative deepening 
is used so that when a timeout occurs, if a move at the root has not been fully searched, then the best move 
from the previous depth search is returned. Transposition tables are used to store the best move to improve 
move ordering for future searches. In addition, to account for the extra overhead of maintaining bound 
information, pruning is ignored at search depths 2 or lower. In MCTS, chance nodes are stored in the tree 
and the selection policy always samples an outcome based on their probability distributions, which are non- 
uniform in every case except EWN. 

Our experiments use a search time limit of 200 milliseconds. MCTS uses utilities in [—100, 100] and a 



UCT exploration constant of C\. Since evaluation functions are available, we augment MCTS with a pa- 
rameter, d r , representing the number of moves taken by the rollout policy before the evaluation function 
is called. MCTS with double-progressive widening (DPW) uses two more parameters C2 and a described 
in Section |2~2l Each algorithm's parameters are tuned via self-play tournaments where each player in the 
tournament represents a specific parameter set from a range of possible parameters and seats are swapped to 
ensure fairness. Specifically we used a multi-round elimination style tournament where head-to-head pair- 
ing consisted of 1000 games (500 swapped seat matches) between two different sets of parameters, winners 
continuing to the next round, and final champion determining the optimal parameter values. By repeating the 
tournaments, we found this elimination style tuning to be more consistent than round-robin style tournament, 
even with a larger total number of games. The sample widths for (expss, starl SS, star2ss) in Pig were found 
to be (20, 25, 18). In EWN, Can't Stop, and Ra, they were found to be (1, 1, 2), (25, 30, 15), and (5, 5, 2) re- 
spectively. In MCTS and DPW, the optimal parameters {C\, d r , C2, a) in Pig were found to be (50, 0, 5, 0.2). 
In EWN, Can't Stop, and Ra, they were found to be (200, 100, 4, 0.25), (50, 10, 25, 0.3), and (50, 0, 2, 0.1) 
respectively. The values of d r imply that the quality of the evaluation function in EWN is significantly lower 
than in other games. 

4.3 Statistical Properties 

Our first experiment compares statistical properties of the estimates and actions returned by *-Minimax, 
MCMS, and MCTS. At a single decision point s, each algorithm acts as an estimator of the true minimax value 
V(s), and returns the action a E A that maximizes V(s, a). Since Pig has fewer than one million states, we 
solve it using the technique of value iteration which has been applied to previous smaller games of Pig [20], 
obtaining the true value of each state V(s). From this, we estimate the mean squared error, variance, bias, 
and regret of each algorithm using MSE[T>(s)] = E[(V(a) - V(s)) 2 } = Var[V(s)] + Bias(V(s), V(s)) 2 
by running each algorithm 50 separate times at each decision point. Then we compute the regret of taking 
action a at state s, Regret(s, a) = V(s) — V(s, a), where a is the action chosen by the algorithm from state s. 
As with MSE, variance, and bias: for a state s, we estimate Regret(s, a) by computing a mean over 50 runs 
starting at s. The estimates of these properties are computed for each state in a collection of states s € Sobs 
observed through simulated games. S b s is formed by taking every state seen through simulated games of 
each type of player plays against each other type of player, and discarding duplicate states. Therefore, the 
states collected represent states that actually visited during game play. We then report the average value of 
each property over these \S b s | = 2470 game states are shown in TableQ] 

The results in the table show the trade-offs between bias and variance. We see that the estimated bias 
returned by expss are lower than the classic *-Minimax algorithms. The performance results below may be 
explained by this reduction in bias. While variance is introduced due to sampling, seemingly causing higher 
MSE, in two of three cases the regret in MCMS is lower than *-Minimax which ultimately leading to better 
performance, as seen in the following section. 

4.4 Playing Strength 

In our second experiment, we computed the performance of each algorithm by playing a number of test 
matches (5000 for Pig and EWN, 2000 for Can't Stop and Ra) for each paired set of players. Each match 
consists of two games where players swap seats and a single randomly generated seed is used for both 
games in the match. To determine the best MCTS variant, 500 matched of MCTS versus DPW were played 
in each domain, and the winner was chosen; (classic MCTS in Pig and EWN, DPW Can't Stop and Ra). The 
performance of each pairing of players is shown in Figure 
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The results show that the MCMS variants outperform their equivalent classic counterparts in every case, 
establishing a clear benefit of sparse sampling in the *-Minimax algorithm. In some cases, the improvement 
is quite significant, such as an 85.0% win rate for star2ss vs Star2 in Can't Stop. MCMS also performs 
particularly well in Ra obtaining roughly 60% wins against it classic *-Minimax counterparts. This indicates 
that MCMS is well suited for densely stochastic games. In Pig and Ra, the best MCMS variant seems to 
perform comparably to the best variant of MCTS; the weak performance EWN is likely due to the lack of a 
good evaluation function. Nonetheless, when looking at the relative performances of classic *-Minimax, we 
see the performance against MCTS improves when sparse sampling is applied. We also notice that in EWN 
expss slightly outperforms starl SS; this can occur when there are few pruning opportunities and the overhead 
added by maintaining the bound information outweighs the benefit of pruning. A similar phenomenon is 
observed for starl SS and star2ss in Ra. 

The relative performance between expss, starl SS, and star2ss is less clear. This could be due to the 
overhead incurred by maintaining bound information reducing the time saved by sampling; i.e. the benefit 
of additional sampling may be greater than the benefit of pruning within the smaller sample. We believe that 
the relative performance of the MCMS could improve with the addition of domain knowledge such as classic 
search heuristics and specially tuned evaluation functions that lead to more pruning opportunities, but more 
work is required to show this. 

5 Conclusion and Future Work 

This paper has introduced MCMS, a family of sparse sampling algorithms for two-player, perfect information, 
stochastic, adversarial games. Our results show that MCMS can be competitive against MCTS variants in 
some domains, while consistently outperforming the equivalent classic approaches given the same amount 
of thinking time. We feel that our initial results are encouraging, and worthy of further investigation. One 
particularly attractive property of MCMS compared with MCTS (and variants) is the ease in which other classic 
pruning techniques can be incorporated. This could lead to larger performance improvements in domains 
where forward pruning techniques such as Null-move Pruning or Multicut Pruning are known to work well. 
For future work, we plan to investigate the effect of dynamically set sample widths, sampling without 
replacement, and the effect of different time limits. In addition, as the sampling introduces variance, the vari- 
ance reduction techniques used in MCTS [27| may help in improving the accuracy of the estimates. Finally, 
we would like to determine the playing strength of MCMS algorithms against known AI techniques for these 
games lHOlEfl. 
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A Proof of Theorem 3] 

In this section, we will prove TheoremQj 

Lemma A.l. For all states s £ S, for all actions a £ A, for all A £ (0, 2u max ] C K, for all c £ N, given a 
set C(s) of c G N states generated according to V{- \ s, a), we have 



\ E %-i(« 



Si£C(s) 



Vd(s, a) 



>A <2exp{-X 2 c/2v 2 max } 



(2) 



Proof First note that v min < Vrf(s) < v max , and since each game is zero-sum, u min = — w max . Also, clearly 
¥, s i^p(. | sa ) [Vrf_i(s')] = Vd(s, a) by definition. This lets us use a special case of Hoeffding's Inequality, 
implied by [13, Theorem 2], which states that for a independent and identically distributed random sample 
Xi,... , X c it holds that 



( - Y^Xi - E[X] > A ] < 2 exp J -2A 2 c 2 / ^(6 - a) 2 I 



(3) 



provided a < Xi < b. Applying this bound, setting b — a to 2 w max and simplifying finishes the proof. □ 
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Proposition 1. Forastate s G S,Vd G Nif V d (s,a) — Vd(s,a) < AholdsVa G A, then V d (s) — Vd{s) < 
A. 

Proof. Recall, V d (s) := max ae ^ Vd(s, a) and V d (s) := max„ e x Vd(s,a) for any state se5. Also define 
a* := argmax a6- 4 Vd(s, a) and a* := argmax ae ^ V d (s, a). Now, it holds that 

V d (s) ~ V d (s) = V d («,o*) - ViC*) < [V d (s,a*) + A] - V&(«) < [^(s.a*) + A] - y d (.s) - A, 

and also 

Vd(s) ~ V d (s) = fo(*,a*) - 7 d (s) > V d (s,a*) - V d (s) > [V d (s,a*) - A] - Vd(s) = -A, 



hence 



V d (s) - V d (s) 



< A. 



a 



Theorem 1. (Theorem from the main paper) Given c G N, /or any state s6 5, /or aZZ A G (0, 2v max ] C M, 
for any depth d G Z+, 

P(|%(«)-%(«)| <Ad) >l-(2c|^|) d exp{-A 2 c/2< ax }. 

Proof We will use an inductive argument. First note that the base case is trivially satisfied for d = 0, since 
Vo(s) — Vq(s) for all s G 5 by definition. Now, assume that the statement is true for some d — 1 G Z + i.e. 



(\Vd-i(s) - V d -i(s)\ < X(d - 1)) > 1 - (2c\A\) d ^ cxp {-\ 2 c J 2t4„} 



(4) 



Next we bound the error for each state-action estimate V d (s, a). We denote by C(s) C S the set of c G N 
successor states drawn from V(- 1 s, a). So, \V d (s, a) — Vd(s, a)\ 



c z — ' 
s s ec(s) 

7 E fc-i(*) 



s;SC(s 



Vd(s,o) 



SiSC(s) 



7 E w< 



SiGC(s) 



- Vd(s,a) 



< - V \Vd-i(s t )-V d -i{ Sl ) 



? eC(i 



1 E Kw(« 



s 4 6C(s) 



K«(s,a) 



(5) 



The first step follows from the definition of Vd(s,a). The final step follows from the fact that \a — b\ < 
\a — c\ + \c — b\, and simplifying. The RHS of Equation (0) consists of a sum of two terms, which we analyze 
in turn. The first term 

- y |Vd-i(si)-i/ d -i(si) 

c z — ' 

Si£C(s) 

is the average of the error in c state value estimates at level d — 1. Now, the event that 



- V \v d -M)-Vd-M) >A(d-i) 



,ec( s ) 
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is a subset of the event that a single estimate is off by more than X(d — 1). Therefore, we have 

P [I E |^-i(si)-Ki-i(si)|>A(d-l) < P( |J |^d_i(«i)-^d_i(«i)| >A(d-l 
\ s s ec(s) / \ Si ec 

< J2 P (|^-i(*0 - ^-i(*)| > A(d - 1) 

SiEC 

< C (2 C |^l|) d - 1 exp{-A 2 c/2< ax }. 
The penultimate line follows from the union bound. The final line applies the inductive hypothesis. 

We now consider the second term 



r *■ — ' 



,eC(s 



V d (s,a) 



of the RHS of Equation ((5]). By Lemma [AT| we have 



\ E %-x( 4 



Si£C(s) 



Vd(s,a) 



>A <2exp{-A 2 c/2< ax } 



(6) 



(7) 



We now have a bound for each of the two terms in the RHS of Equation (0, as well as the probability 
with which that bound is exceeded. Notice that the value of the RHS can exceed the sum of the two terms' 
bounds if either term exceeds its respective bound. Using this, we get 



- E \Vd-i(si)-V d ^{s 



SiGC(s) 



~ E tw*) 



Si ec(s) 
- E \Vd-i(si) - Vi_i(* 4 ) > A(d - 1) ] + 



- V£(s,a) 



;GC(s 



- E ^-i(*o 



,ec(s 



>Ad < 



Vd(s,a) 



> A 



by the union bound and the fact that if x + y > K then either x > k\ or y > k-2, where k\ + k2 = K. 
Specifically, the event on the left-hand side is a subset of the union of the two events on the right-hand side. 

Continuing, we can apply Equations [6] and [7] to the above to get an upper bound of 

c (2c\A\) d ~ 1 exp {-\ 2 c I 2< ax } + 2cxp {-A 2 c / 2< ax } 
= (2 + c (2c|^|) d - 1 ) exp {-A 2 c / 2v 2 max } 

< (2c) d (|^l|) d - 1 exp{-A 2 C /2< ax }, (8) 

where the final two lines follow from standard calculations and the fact that c > 1. 
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Recall from Equation|5]that 



V d (s,a)-V d (s,a) < - V \v d -i{ Sl ) - V d _i( Si ) 



ec(s 



SiECls 



\ E vk-i(«i) 



SiSC(s) 



Vd(s,a) 



This allows us to bound the probability that V d (s, a) differs from Vd{s, a) by more than Ad, by using Equation 
[8] as follows 



P(\v d {s,a)-V d (s,a) 



>Xd 



< p[i J2 \v d -i(si)-V d -i( Si 



Si<=C{s 
\ d n Ai\d-1 



- £ v d ^( Sl ) 

r *■ — ' 



ec(s 



SiEC s 



Ki(s,a) 



> Ad 



< (2 c ) d (|^|) d - i exp{-A 2 C /2< ax } 



(9) 



We know from Proposition[T]that if all of the chance node value estimates are accurate then the decision 
node estimate must also be accurate. This allows us to consider the probability of the event that at least one 
of the chance node value estimates Vi(s, a) deviates by more than Ad, that is, 

p( |J \v d (s,a)-V d (s,a)\ > Xd) < ^ P (jv d (s,a) - V d (s,a)\ > Ad) 

\aeA ) a€A 

by the union bound. Applying Equation|9] we get 

P(|J \v d (s,a)-V d (s,a)\ > Xd) < (2c|^|) d cxp {-X 2 c / 2w 2 iax } , 

\aeA J 



hence 



l-p( |J \v d {s,a)-V d (s 7 a)\ > Xd) 

\aeA I 



>l-(2 c |„4|) d exp{-A 2 c /2< ax } 



Then by De Morgan's law, we have 



( f| \v d (s,a)-V d (s,a)\ < Xd) > 1 - (2c|.4|) d cxp {-A 2 c / 2w 2 iax } , 



which combined with Proposition[T|implies that 



V d (s) - V d (s) 
which proves the inductive step. 



<Ad >l-(2c|^|) (i C xp{-A 2 c/2t;L x }, 



D 
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